Use SQLAlchemy with the UCSC Database

We are going to use SQLAlchemy first to pull a list of tables, and then to create an ORM of the table we care about, in this case SNP147.

To do this we will be makign use of SQLAlchemy's autmapping functionality, which creates ORM object directly from tables. Unfortunately, it fails with a number of USCS tables because they don't have detectable primary keys, which are required for the ORM. Because of this we will first inspect the database and dump a list of tables, and then explicity declare a class for the table we care about. SQLAlchemy will fill in all extra columns for us, so we only need to explicitly declare the class and primary key, everything else is done for us.



In [1]:

    
import pandas as pd
from sqlalchemy import inspect
from sqlalchemy import create_engine
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import Session
from sqlalchemy.sql import select
from sqlalchemy.ext.automap import automap_base



In [2]:

    
# Connect to the hg19 database
engine = create_engine("mysql+pymysql://genome@genome-mysql.cse.ucsc.edu/{organism}?charset=utf8mb4".format(organism='hg19'))

Get table



In [3]:

    
# Get the list of tables
inspector = inspect(engine)
tables = inspector.get_table_names()

This list is very long. You can just look through it, but we are going to filter it for tables that begin with 'snp'



In [4]:

    
len(tables)









    Out[4]:





11048



In [5]:

    
snp_tables = [i for i in tables if i.startswith('snp')]



In [6]:

    
sorted(snp_tables)









    Out[6]:





['snp138',
 'snp138CodingDbSnp',
 'snp138Common',
 'snp138ExceptionDesc',
 'snp138Flagged',
 'snp138Mult',
 'snp138OrthoPt4Pa2Rm3',
 'snp138Seq',
 'snp141',
 'snp141CodingDbSnp',
 'snp141Common',
 'snp141ExceptionDesc',
 'snp141Flagged',
 'snp141OrthoPt4Pa2Rm3',
 'snp141Seq',
 'snp142',
 'snp142CodingDbSnp',
 'snp142Common',
 'snp142ExceptionDesc',
 'snp142Flagged',
 'snp142Mult',
 'snp142OrthoPt4Pa2Rm3',
 'snp142Seq',
 'snp144',
 'snp144CodingDbSnp',
 'snp144Common',
 'snp144ExceptionDesc',
 'snp144Flagged',
 'snp144Mult',
 'snp144OrthoPt4Pa2Rm3',
 'snp144Seq',
 'snp146',
 'snp146CodingDbSnp',
 'snp146Common',
 'snp146ExceptionDesc',
 'snp146Flagged',
 'snp146Mult',
 'snp146OrthoPt4Pa2Rm3',
 'snp146Seq',
 'snp147',
 'snp147CodingDbSnp',
 'snp147Common',
 'snp147ExceptionDesc',
 'snp147Flagged',
 'snp147Mult',
 'snp147OrthoPt4Pa2Rm3',
 'snp147Seq',
 'snpArrayAffy250Nsp',
 'snpArrayAffy250Sty',
 'snpArrayAffy5',
 'snpArrayAffy6',
 'snpArrayAffy6SV',
 'snpArrayIllumina1M',
 'snpArrayIllumina1MRaw',
 'snpArrayIllumina300',
 'snpArrayIllumina550',
 'snpArrayIllumina650',
 'snpArrayIlluminaHuman660W_Quad',
 'snpArrayIlluminaHuman660W_QuadRaw',
 'snpArrayIlluminaHumanCytoSNP_12',
 'snpArrayIlluminaHumanCytoSNP_12Raw',
 'snpArrayIlluminaHumanOmni1_Quad',
 'snpArrayIlluminaHumanOmni1_QuadRaw']

OK, so let's use snp147.

Get Columns and Build ORM

We are now going to make the ORM, so let's check the actual columns of the table.



In [7]:

    
inspector.get_columns('snp147')









    Out[7]:





[{'autoincrement': False,
  'default': None,
  'name': 'bin',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None,
  'name': 'chrom',
  'nullable': False,
  'type': VARCHAR(length=31)},
 {'autoincrement': False,
  'default': None,
  'name': 'chromStart',
  'nullable': False,
  'type': INTEGER(display_width=10, unsigned=True)},
 {'autoincrement': False,
  'default': None,
  'name': 'chromEnd',
  'nullable': False,
  'type': INTEGER(display_width=10, unsigned=True)},
 {'default': None,
  'name': 'name',
  'nullable': False,
  'type': VARCHAR(length=15)},
 {'autoincrement': False,
  'default': None,
  'name': 'score',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None,
  'name': 'strand',
  'nullable': False,
  'type': ENUM('+', '-')},
 {'default': None, 'name': 'refNCBI', 'nullable': False, 'type': BLOB()},
 {'default': None, 'name': 'refUCSC', 'nullable': False, 'type': BLOB()},
 {'default': None,
  'name': 'observed',
  'nullable': False,
  'type': VARCHAR(length=255)},
 {'default': None,
  'name': 'molType',
  'nullable': False,
  'type': ENUM('unknown', 'genomic', 'cDNA')},
 {'default': None,
  'name': 'class',
  'nullable': False,
  'type': ENUM('single', 'in-del', 'microsatellite', 'named', 'mnp', 'insertion', 'deletion')},
 {'default': None, 'name': 'valid', 'nullable': False, 'type': SET(length=15)},
 {'default': None, 'name': 'avHet', 'nullable': False, 'type': FLOAT()},
 {'default': None, 'name': 'avHetSE', 'nullable': False, 'type': FLOAT()},
 {'default': None, 'name': 'func', 'nullable': False, 'type': SET(length=14)},
 {'default': None,
  'name': 'locType',
  'nullable': False,
  'type': ENUM('range', 'exact', 'between', 'rangeInsertion', 'rangeSubstitution', 'rangeDeletion', 'fuzzy')},
 {'autoincrement': False,
  'default': None,
  'name': 'weight',
  'nullable': False,
  'type': INTEGER(display_width=10, unsigned=True)},
 {'default': None,
  'name': 'exceptions',
  'nullable': False,
  'type': SET(length=26)},
 {'autoincrement': False,
  'default': None,
  'name': 'submitterCount',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None,
  'name': 'submitters',
  'nullable': False,
  'type': LONGBLOB()},
 {'autoincrement': False,
  'default': None,
  'name': 'alleleFreqCount',
  'nullable': False,
  'type': SMALLINT(display_width=5, unsigned=True)},
 {'default': None, 'name': 'alleles', 'nullable': False, 'type': LONGBLOB()},
 {'default': None, 'name': 'alleleNs', 'nullable': False, 'type': LONGBLOB()},
 {'default': None,
  'name': 'alleleFreqs',
  'nullable': False,
  'type': LONGBLOB()},
 {'default': None,
  'name': 'bitfields',
  'nullable': False,
  'type': SET(length=33)}]

There are a lot of columns, but the vast majority will automap just fine, so we don't need to do anything. However, we do need to add a primary key as none exists here.

I actually just want to look up the SNP name by the position, so I am going to explicitly declare a primary key (since I don't care about their primary key and some UCSC tables have primary keys that are not dectected by the automapper) and also declare the columns I care about. The table will end up with all columns, but I want to guarantee access to the columns that I care about.



In [8]:

    
# The automap base will detect all tables in the database and create classes for
# as many as it can. Many UCSC tables don't become classes because the primary key
# cannot be detected.
Base = automap_base()

class SNP147(Base):
    __tablename__ = 'snp147'
    
    name = Column(String(length=15), primary_key=True, nullable=False)
    
    # The following columns do not need to be declared, the automapper will do it for
    # us. I map them anyway for my own personal reference.
    chrom      = Column(String(length=31), nullable=False)
    chromStart = Column(Integer, nullable=False)
    chromEnd   = Column(Integer, nullable=False)

# reflect the tables
Base.prepare(engine, reflect=True)



In [9]:

    
session = Session(engine)

Query the table



In [10]:

    
session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd == 16952481).first()









    Out[10]:





('rs915311')



In [11]:

    
positions = [
    154326279,
    11029552,
    241803636,
    59165838,
    39991588,
    204733046,
    16341354,
    16971948,
    154056834,
    9712006
]



In [12]:

    
q = session.query(SNP147.name).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions))



In [13]:

    
q.all()









    Out[13]:





[('rs78800005'),
 ('rs74052099'),
 ('rs1048245'),
 ('rs2209174'),
 ('rs3738676'),
 ('rs11207297'),
 ('rs72694300'),
 ('rs4040747'),
 ('rs12034642'),
 ('rs7513575')]



In [14]:

    
df = pd.read_sql_query(q.statement, engine)



In [15]:

    
df



In [16]:

    
df2 = pd.read_sql_query(
    session.query(SNP147.name, SNP147.chrom, SNP147.chromEnd).filter(SNP147.chrom == 'chr1').filter(SNP147.chromEnd.in_(positions)).statement,
    engine
)



In [17]:

    
df2

That's it, simple! Yes, you can just query the MySQL directly, but this approach is much more robust for on-the-fly data analysis. It is easy to get SQL syntax wrong, but using SQLAlchemy in Jupyter is trivial.

	name
0	rs78800005
1	rs74052099
2	rs1048245
3	rs2209174
4	rs3738676
5	rs11207297
6	rs72694300
7	rs4040747
8	rs12034642
9	rs7513575

	name	chrom	chromEnd
0	rs78800005	chr1	9712006
1	rs74052099	chr1	11029552
2	rs1048245	chr1	16341354
3	rs2209174	chr1	16971948
4	rs3738676	chr1	39991588
5	rs11207297	chr1	59165838
6	rs72694300	chr1	154056834
7	rs4040747	chr1	154326279
8	rs12034642	chr1	204733046
9	rs7513575	chr1	241803636